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Abstract 

We develop a coherent framework for integra- 
tive simultaneous analysis of the exploration- 
exploitation and model order selection trade- 
offs. We improve over our preceding re- 
sults on the same subject (Seldin et al., 2011) 
by combining PAC-Bayesian analysis with 
Bernstein-type inequality for martingales. 
Such a combination is also of independent in- 
terest for studies of multiple simultaneously 
evolving martingales. 



1. Introduction 

The trade-off between exploration and exploitation 
is a fundamental question in reinforcement learning. 
Model order selection, which is a trade-off between 
model complexity and its empirical data fit, is even a 
more basic question in machine learning. To the best 
of our knowledge, we develop the first framework that 
enables to consider these two trade-offs simultaneously 
from a finite sample perspective. The importance of 
simultaneous consideration of the two trade-offs can 
be illustrated by the following simple example. Imag- 
ine we have a web page, where we can show a visi- 
tor a single advertisement out of a pool of advertise- 
ments. Assume that we are given access to additional 
side information about the visitors, which we are al- 
lowed to use in our choice of advertisements (this is 



generally known as contextual bandits problem) . Fur- 
ther, imagine that the amount of available (contex- 
tual) side information is very large (and potentially 
unlimited). Considering all side information from the 
beginning will result in an overcomplicated model that 
will take prohibitively many trials to learn. Instead, 
similar to supervised learning, we should start with a 
simple model and increase its complexity as our experi- 
ence grows. However, unlike in supervised learning, we 
have to learn under limited feedback. This means that 
the model order selection trade-off has to be consid- 
ered simultaneously with the exploration-exploitation 
trade-off. We develop an integrative framework that 
provides finite sample guarantees for both trade-offs 
simultaneously. 

Our solution is based on extending PAC-Bayesian 
analysis of supervised learning with i.i.d. samples to 
problems with limited feedback and sequentially de- 
pendent samples. PAC-Bayesian analysis was intro- 
duced over a decade ago (Shawe- Taylor & Williamson, 
1997; Shawe- Taylor et al, 1998; McAllester, 1998; 
Seeger, 2002) and has since made a significant con- 
tribution to the analysis and development of su- 
pervised learning methods. The power of PAC- 
Bayesian approach lies in successful marriage of the 
flexibility and intuitiveness of Bayesian models with 
the rigor of PAC analysis. PAC-Bayesian bounds 
provide an explicit and often intuitive and easy- 
to-optimizc trade-off between model complexity and 
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empirical data fit, where the complexity can be 
nailed down to the resolution of individual hypothe- 
ses via the prior definition. The PAC-Bayesian 
analysis was applied to derive generalization bounds 
and new algorithms for linear classifiers and max- 
imum margin methods (Langford & Shawe- Taylor, 
2002; McAllester, 2003; Germain et al., 2009), struc- 
tured prediction (McAllester, 2007), and clustering- 
based classification models (Seldin & Tishby, 2010), to 
name just a few. However, the application of PAC- 
Bayesian analysis beyond the supervised learning do- 
main remained surprisingly limited. In fact, the only 
additional domain known to us is density estimation 
(Seldin & Tishby, 2010; Higgs & Shawe-Taylor, 2010). 

Some potential advantages of applying PAC- 
Bayesian analysis in reinforcement learning were 
recently pointed out by several researchers, including 
Tishby & Polani (2010) and Fard & Pineau (2010). 
Tishby & Polani (2010) suggested that the mutual 
information between states and actions in a policy 
can be used as a natural rcgularizcr in reinforcement 
learning. They showed that rcgularization by mutual 
information can be incorporated into Bellman equa- 
tions and thereby computed efficiently. Tishby and 
Polani conjectured that PAC-Bayesian analysis can 
be applied to justify such form of rcgularization and 
provide generalization guarantees for it. 

Fard & Pineau (2010) suggested a PAC-Bayesian anal- 
ysis of batch reinforcement learning. However, 
batch reinforcement learning does not involve the 
exploration-exploitation trade-off. 

One of the reasons for the difficulty of applying 
PAC-Bayesian analysis to address the exploration- 
exploitation trade-off is the limited feedback (the fact 
that we only observe the reward for the action taken, 
but not for all the rest). In supervised learning (and 
also in density estimation) the empirical error for each 
hypothesis within a hypotheses class can be evaluated 
on all the samples and therefore the size of the sam- 
ple available for evaluation of all the hypotheses is the 
same (and usually relatively large). In the situation 
of limited feedback the sample from one action can- 
not be used to evaluate another action and the sample 
size of "bad" actions has to increase sublinearly in the 
number of game rounds. In (Seldin et al., 2011) we re- 
solved this issue by applying weighted sampling strat- 
egy (Sutton & Barto, 1998), which is commonly used 
in the analysis of non-stochastic bandits (Auer et al., 
2002), but has not been applied to the analysis of 
stochastic bandits previously. 

The usage of weighted sampling introduces two new 
difficulties. One is sequential dependence of the sam- 



ples: the rewards we observe influence the distribution 
over actions we play and through this distribution in- 
fluence the variance of the subsequent weighted sample 
variables. In (Seldin et al., 2011) we handled this de- 
pendence by combining PAC-Bayesian analysis with 
Hocffding-Azuma-typc inequalities for martingales. 

The second problem introduced by weighted sampling 
is the growing variance of the weighted sample vari- 
ables. We did not succeed to take full control over 
the variance in (Seldin et al., 2011) and the bound 
we obtained there depended on 1/e*, where et is the 
minimal probability for sampling any action at time 
step t. Here we improve this dependence to 1/^/e* 
by combining PAC-Bayesian analysis with Bernstein- 
type inequality for martingales. This improvement en- 
ables to tighten the regret bounds from 0(K 1 l' 1 t 3 l^') 
to C^if 1 / 3 ^/ 3 ), where K is the number of arms and 
t is the game round. The combination PAC-Bayesian 
analysis with Bernstein-type inequality for martingales 
is also of independent interest for studies of multiple 
simultaneously evolving martingales. 

At the end of Section 2 we suggest possible ways 
to tighten the analysis further to get 0(V Kt) regret 
bounds. These further improvements will be studied 
in detail in future work. 

We emphasize that although this paper is focused 
on the multiarmed bandit problem, our main goal is 
not improving existing bounds for stochastic multi- 
armed bandits, which are already tight up to -\/ln(A") 
factors (Audibcrt & Bubcck, 2009; Auer & Ortner, 
2010), but rather developing a new powerful tool for 
reinforcement learning in domains with a richer struc- 
ture. For example, Beygelzimer et al. (2010) sug- 
gested O (yWki^/S) \ and O (^t{d\nt - ]nSfj re- 
gret bounds for learning with expert advice in the ban- 
dit setting, where -/V is the number of experts (in case 
it is finite) and d is the VC-dimension of the set of 
experts (in case it is infinite). We believe that PAC- 
Bayesian analysis should enable to replace ln( N) and d 
factors with KL(p\\fi), where p(h) is a distribution over 
experts played by the algorithm and p(h) is a prior 
distribution over experts that, for example, can reflect 
their complexity, and KL is the A'L-divergence. Such 
an approach is much more flexible, since it allows indi- 
vidual treatment of different experts (or policies) via 
the prior definition fi and can be applied to both finite 
and infinite policy spaces (or expert sets). Our experi- 
ence in supervised learning shows that PAC-Bayesian 
analysis is also handful for treating tree-shaped graphi- 
cal models (since A"L-divergence decomposes into sum 
of KL-s according to the tree structure). This prop- 
erty can also be useful for contextual bandits and other 
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reinforcement learning problems. 

The subsequent sections are organized as follows: Sec- 
tion 2 surveys the main results of the paper and Sec- 
tion 3 discusses the results. All the proofs are provided 
in the appendix. 

2. Main Results 

We start with a general concentration result for mar- 
tingales, which is based on combination of PAC- 
Bayesian analysis with Bernstein-type inequality for 
martingales. We apply this result to derive an instan- 
taneous (per-round) generalization bound for the mul- 
tiarmed bandit problem. This result is in turn applied 
to derive an instantaneous regret bound for the multi- 
armed bandits. 

2.1. PAC-Bayes-Bernstein Inequality for 
Martingales 

In order to present our concentration result for mar- 
tingales we need a few definitions. Let % be an 
index (or a hypothesis) space, possibly uncountably 
infinite. Let {Xi(h), X2(h), ...} be martingale dif- 
ference sequences, meaning that E[X t (h)\Tt-i] = 0, 

where Tt = {X T (h)}i< T < t , is a set of martingale 

hen ' 

differences observed up to time t. ({X t (h)}h^n do 
not have to be independent, we only need that the 
requirement on the conditional expectation is satis- 
fied.) Let M t (h) = 'Y^ T=1 X T {h) be martingales. Let 

V t (h) = J2l=i E[X T (h) 2 \T T -i] be cumulative variances 
of the martingales. For a distribution p over H define 
M t {p) = E p(h) [M t (h)\ and V t (p) = E p{h) [V t (h)}. 

Theorem 1 (PAC-Bayes-Bernstein Inequality). As- 
sume that \X t (h)\ < C for all t and h. Let {pi, P2, •■■} 
be a sequence of "reference" {"prior") distributions 
over %, such that p t is independent of Tt (but can 
depend on t). Let {VI, V^,-.-} be a sequence of arbi- 
trary numbers, such that Vt is independent of Tt (but 
can depend on t) and satisfy: 



< 



1 



where 



(e-2)V t ~ C 



L t = 21n(t + 1) + m- 
o 



(1) 



Then for all possible distributions p t over % given t 
and for all t simultaneously: 



\Mt( P t)\ < V(e-2) 



KL( Pt \\p t )M t 



Vt(p t ) x /^ + VLtV t 



2.2. Application to the Multiarmed Bandit 
Problem 

In order to apply our result to the multiarmed ban- 
dit problem we need some more definitions. Let A 
be a set of actions (arms) of size \A\ = K and let 
a E A denote the actions. Denote by R(a) the ex- 
pected reward of action a. Let n t be a distribution 
over A that is played at round t of the game. Let 
{Ai, Ai-, ...} be the sequence of actions played indepen- 
dently at random according to {iti, iT2, ■■■} respectively. 
Let {Ri, i?2, •■•} be the sequence of observed rewards. 
Denote by Tt = {{Ai, .., A t }, {Ri, .., R t }} the set of 
taken actions and observed rewards up to round t (by 
definition 7t-i C Tt)- 

For t > 1 and a £ {1,..,K} define a set of random 
variables i?": 



R° t 



7Tt(a) 

0. 



Rt, it A f 



otherwise. 



Define: 



R t (a) 



1 



T = l 



(2) 



Observe that ER t (a) = R(a). 

Let a* be the best action (the action with the highest 
expected reward, if there are multiple "best" actions 
pick any of them). Define: 

A(a) = R(a*)-R(a) 
A t (a) = R t (a*)- R t (a). 

Observe that t (jkt(a) — A(a)^j form a martingale. Let 

t 

W t (a) = J2 nW? K] [R(a*) R(a)]f\Tr-i] 

T = l 

be the cumulative variance of this martingale. 

Let {ei,£2, ■•■} be a decreasing sequence that satisfies 
e t < min a ir t (a) . In the appendix we prove the follow- 
ing upper bound on W t (a). 

Lemma 1. For all a: 

W t (a) < -. 

Et 

For a distribution p over A define A(p) = E p ( Q )[A(a)] 
and A t (p) = E p ( a ) [A* (a)]. The following theorem fol- 
lows immediately from Theorem 1 and Lemma 1 by 
taking V t = f . 
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Theorem 2. For any sequence of sampling distribu- 
tions {7Ti,7T2, ...} that are bounded from below by a de- 
creasing sequence {£1,62,...} that satisfies 



2(e-2)t 



<£t, 



(3) 



where 7r t can depend on Tt-i, and for any sequence 
of "reference" distributions {pi,p2,---} over A, such 
that fi t is independent ofTt (but can depend on t), for 
all possible distributions pt given t and for all t > 1 
simultaneously with probability greater than 1 — 5: 



+ 2JI 



Ata) _ Alta) | £ ^(™ 

(4) 

Theorem 2 provides an improvement over the corre- 
sponding Theorems 2 and 3 in (Seldin et al., 2011) by 
decreasing the dependence on et from l/et to 1/ Jet.. 
This in turn allows to improve the regret bound, which 
is shown next. 



Theorem 3. For t < K let n t (a) = -h for all a. Let 

7t = K-^H^Zyfi^K and e t = R-^H' 1 / 3 and for 
t>(K-l) let 

ir t+1 (a) = pT F (a) = (1 - Ket+ijpT" (a) + £t+i, (5) 

where 

and 

z{pT p ) = ^ e7tAt(a) - 



Then for t > max jif,# 4 ( e - 2 ).y|j and satisfying (3) 

(which means that 2 ]n(t + 1) + In § < 2(e-2) (^) 2/3 ) 
the per-round regret R(a*) — R(p"t LV ) is bounded by: 

#1/3 



R(a*)-R(pD < 



(16( e - 2) + l)Vh!# 
(t+1) 1 / 3 V +V 2 ( e - 2)L t + 1 



wii/i probability greater than 1 — 5 for all rounds t 
simultaneously. This translates into a total regret of 
0(K 1 / 3 t 2 / 3 ) (where O hides logarithmic factors). 

Theorem 3 improves the dependence on t and K from 
0(#i/2 t 3/4) in ( Scldin ct aL) 2011) to OiK^H 2 / 3 ). 

This improvement is due to better concentration result 
in Theorem 2 (which is based on Theorem 1). 

We note that there is still room for improvement, 
which wc believe will enable to achieve regret bounds 
of 0(V Kt). The main source of looseness is the us- 
age of the crude global upper bound 7^ on the cu- 
mulative variances that holds for any distribution p t . 



It is possible to show that we play according to the 
distributions {p^ xp , .., pj xp }, then for "good" actions a 
(those for which A (a) < ^) the cumulative variance 
W t (a) is bounded by CKt for some constant C . If we 
could show that for "bad" actions a (those for which 
A (a) > the probability p"t v of picking such ac- 
tions is bounded by Cet/K 7 then the cumulative vari- 
ance Wt(pl xp ) would be bounded by CKt. This is, 
in fact, true for "very bad" actions (those, for which 
A (a) is close to 1) and it is also possible to show that 
it holds for p.r p (and hence W t (nT v ) < CKt), but 
it does not hold for actions with A(a) close to — . 

.7* 

However, we can possibly show that for such actions 
pT v ( a ) — Cet/K for most of the rounds (1 — et frac- 
tion should suffice) and then we will be able to achieve 
0(y/Kt) regret. This research direction will be ex- 
plored in more details in future work. 

3. Discussion 

Wc presented an improved PAC-Bayesian analysis of 
martingales that is based on combination of PAC- 
Bayesian bound with Bernstein-type inequality for 
martingales. The new bound enables to provide bet- 
ter finite sample generalization and regret guarantees 
for exploration-exploitation and model order selection 
trade-offs simultaneously. There are several important 
and fascinating research directions that take root at 
our result. 

First, our concentration result for martingales can be 
of interest in any study of multiple simultaneously 
evolving and possibly interdependent martingales, es- 
pecially when the number of martingales is uncount- 
ably infinite and standard union bounds cannot be ap- 
plied. Just as an example, our result can be applied 
to derive new generalization bounds for active learning 
(Beygelzimer et al., 2009). 

Another important direction is to tighten Theorems 2 
and 3, so that the regret bound will match state-of-the- 
art regret bounds obtained by alternative techniques. 
We believe that the ideas mentioned at the end of the 
previous section can make it possible. 

Once we have a bound that matches state-of-the- 
art regret bounds we can extend the technique to 
richer problems with large or infinite number of 
states, such as contextual bandits (Beygelzimer et al., 
2010), or large or infinite number of actions, such 
as Gaussian process bandits (Srinivas et al., 2010). 
Through definition of appropriate priors over hypoth- 
esis spaces, PAC-Bayesian approach should enable to 
obtain bounds that involve natural measures of model 
complexity, such as mutual information between states 
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and actions in contextual bandits. Such a measure of 
model complexity is more flexible than plain number of 
experts or VC-dimcnsion used in (Beygclzimer et al., 
20f 0) since it allows to differentiate between complex- 
ities of individual hypotheses. A similar analysis was 
already performed and proved successful in the con- 
text of co-clustering in supervised and unsupervised 
learning (Seldin & Tishby, 2010). 

A. Proofs 

In this appendix we provide the proofs of Theorems 1 
and 3 and Lemma 1. 

A.l. Proof of Theorem 1 

The proof of Theorem 1 relies on the following two lem- 
mas. The first one is a Bernstein-type inequality, see 
the proof of Theorem 1 in (Beygelzimer et al., 2010) 
for a proof. 

Lemma 2 (Bernstein's inequality). Let Xi,..,X t 

be a martingale difference sequence [meaning that 
E[X T |Xi,..,X T _i] = for all t), such that X T < C 
for all t. Let AI t = J^t=i -^t ^ e ^ e corresponding 
martingale and V f — ~Y^ T=1 V\X 2 \Xi, ..,X T _i] be the 
cumulative variance of this martingale. Then for any 
fixed A G [0, 

Ee AA/ t -(e-2)A 2 V t < j 



The second lemma originates in statistical physics 
and information theory (Donsker & Varadhan, 1975; 
Dupuis & Ellis, 1997; Gray, 2011) and forms the basis 
of PAC-Bayesian analysis. See (Banerjee, 2006) for a 
proof. 

Lemma 3 (Change of measure inequality). For any 

measurable function 4>{h) on % and any distributions 
p(h) and p(h) on T-L, we have: 



^ P (hMh)} < KL(p\\n) + \nE Kh) [e 



Now we are ready to state the proof of Theorem 1. 



Proof of Theorem 1. Take <j>{h) = \ t M t (h) 



2)\ 2 t V t {h) and S t 



T 5. (It is well- 



t(t+\) u - (t+iy- 

= Et=i (\-&) = i.) 



known that t(t+i 
Then the following holds for all pt and t simultane 



ously with probability greater than 1 — ^: 

\ t M t (p t )-(e-2)\*V t (p t ) 

= E pt{h) [X t M t (h) - (e- 2)X 2 t V t (h)) (7) 

< KL{ Pt \\p t ) +lnE, t(fl) [e AtM '<"^(- 2 ) A ^'^)] 

(8) 

2 

< KL(p t \\pt) + 2 ln(t + 1) + In - 

o 

+ lnE Tf E Alf(ft) [ e A ' M '(' i )-( e - 2 ) A ^'(' i )] (9) 
= KL(ftt\\iH) + L t 

+ lnE, lt(M E rf [e AtMt (' 1 )-( e - 2 ) A ? yt (' l) ] (10) 
<KL(p t \\p t )+L t , (11) 

where (7) is by definition of M t (pt) and V t (pt), (8) is 
by Lemma 3, (9) holds with probability greater than 
1 — | by Markov's inequality and a union bound over 
t, (10) is due to the fact that pt is independent of 7t 
and by definition of L t , and (11) is by Lemma 2. 

By applying the same argument to martingales 
—M t (h) and taking a union bound over the two we 
obtain that with probability greater than 1 — 5: 

\M t (p t )\< KL(p t \\lM) + (e-2)X?V t (p t ) + L^ 



By taking 



A t = 



^ (e-2)Vt 

and substituting into (12) we obtain (2). The technical 
condition (1) follows from the requirement that A t 6 
[0,£]. □ 

A. 2. Proof of Lemma 1 

Proof of Lemma 1. 
t 

W t (a) = J2n(lK - K\ [R(a*) RMflTr-!} 
= {y,n(K' -Kf\%-i\\ -tA(af (13) 

<(e(^+?SV)-*a<«) 2 (») 



^[ \ 7r r(a) 2 7T r (a*) 5 

~ ( TT< 



- iA(a) 5 



2t 
< ~, 

St 



(15) 



where (13) is due to the fact that E[i?°|7J--i] = R(a), 
(14) is due to the fact that Rt < 1 and (15) is due to 
the fact that — W < — for all a and 1 < r < t. □ 

7r T (a) — et — — 
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A.3. Proof of Theorem 3 



Thus: 



Proof of Theorem 3. We take the same prior p t (a) 
that was used in (Seldin et al., 2011) 



PT" ( a ) 



oltR{a) 



z(pD 



(16) 



where Z{p° t " p ) = ^2 a e ltR ^ is the normalization fac- 
tor. 

We reuse the same regret decomposition we had in 
(Seldin et al., 2011), but write it in a new form using 
A-s: 

A(pr») = A(pr) + [R(pD - R{PD\ 

< [A( P r ) - K{pD] + MpD + Ke t+1 

(17) 



<[A( P r)-A t ( P r)} 



In A' 

It 



Ke t +i, 



(18) 



where in (17) we used the bound on [i?( j O( Xp )-i?( ( 5j xp )] 
obtained in (Seldin et al., 2011) and in (18) we used 
Lemma 4 given below. Note that due to working with 
A-s we are left to bound only one term instead of two 
terms we had to bound in (Seldin et al., 2011). 

Lemma 4. Let x\ = and X2,-,x n ben — 1 arbitrary 
numbers. For any a > and n > 2: 



EjUjgjf ax% < H n ) 



(19) 



Proof. Since negative Xj-s only decrease the left hand 
side of (19) we can assume without loss of generality 
that all Xj-s are positive. Due to symmetry, the max- 
imum is achieved when all Xj-s (except Xi) are equal: 



V™ x 



En 
.7 = 1 e 



< max 



(n — l)xe ax 
1 + (n- l)e~ ax ' 



(20) 



We apply change of variables y = e ax , which means 
that x = i In ^ . By substituting this into the right 
hand side of (20) we get 

1 (n-l)ylni 
a 1 + (n — l ) y ' 

In order to prove the bound we have to show that 

(n- l)y In A 

-rr--, tt- 2 - < In 71. 

l + (n-l)y — 

By taking Taylor expansion of In z around z = n we 
have: 

1 z 
In z < In 7i H (z — ti) = In n H 1 . 

71 71 



(7i-l)ylni _ (n-l)»(lnn + ^-l) 



1 + (n - l)y ~ 1 + (n - l)y 

< 



< 



y(n- l)hi7i+^ 

(n - l)y + 1 
(y(n — 1) + 1) Inn 
y(n-l) + l 



In 7i, 



where the last inequality follows from the fact that 
^ < Inn for ti > 2. □ 



In order to obtain an explicit bound on [A(p t ) — A t (pt)] 
we need an explicit bound on KL{pl" p H/z^). To ob- 
tain such a bound we modify the procedure that was 
used in (Seldin et al., 2011), which in turn was based 
on the procedure developed by Lever et al. (2010). 
Due to tighter concentration inequality in Theorem 
1 we obtain a tighter bound on KL(pl" p \\pl mp ). 

The derivation procedure starts with the following 
lemma, which is proved similarly to Lemma 12 in 
(Seldin et al., 2011). 

Lemma 5. For \i^ v and p\ lp defined by (16) and (6): 



KL{ P r\\pT p )<it 



[A{pD - A t {pD\ 
+[A t (/iP) - A{nD 



Proof. We use the following definitions: 

a 

= y^ e -7i(«(o*)--R(o)) 

a 

= e-~t tR(a ">Z(nl xp ). 

z'(p7 p ) = ^V 7tAt(a) 



- lt {Rt(a')-R t (a)) 



= e-~ 1tRti - a * ) Z(pT P ). 

The following identity is easily verified from the defi- 
nitions: 



1 



= Pt(a)e 



Ht(a)e 



-1tR{a) plt R(a*) 



7tA(a) 
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Now we have: 



By simple algebraic manipulations we obtain that 



KL{pr\\»r) = Y J PM^ 



^pt(a) In 



e - 7t At(a) Z'{ll7 V ) 

e -itM")Z'(p a t * p ) 



7t [A(pr)-At(pr)]-ln 



-7t A t (a) 



o7t (A(a)-A t (a)) 



< 



7t {[a(pd am*")] + impD - A(^r )]) 



□ 



Now we want to get an explicit upper bound on 
KL(pl" p \\p'j.' cp ). Note that for our choice of the tech- 
nical condition (3) of Theorem 2 is satisfied by t large 
enough, so that 



2 / t 

21n(t+ 1) +ln- < 2(e-2) — 
o \ K 



2/3 



(This requirement is satisfied by t = O (in ^) 3 ^J •) 
By Theorem 2 with probability greater than 1 — 5: 

A{ P r)-K{pD 



and 



A t ( M r ) - A(/xr ) < h 



(21) 



t 2(e-2)L t 



tst 



By substituting this into Lemma 5 we obtain: 

KL{ P r\\pr) 



By reorganizing the terms: 



KW 1W")|i-7n/%^)<4 T „' 2(e - 2)i ' 



te t L t 

Note that for our choice of -ft and £t: 



(22) 



1/1 



' 2(e-2) 
te t L t 



I 2(e - 2)A" 
21n(i + 1) + ln 



2 ' 

5 



for 



J 2 -^ < ~ 
n V te t L t ~ 2 



(23) 



By substituting (23) into (22) we obtain that: 



KL<jir\\tir)<*it\ 



'2(e-2)it 



By substituting this into (21) we obtain 



,M( 87t( /?<i^) + 



For our choice of ~f t and e*: 

A( , r) _A l(P r.)<^^^- 2 »^ 



tV3 



-2y/2(e-2)L t 



Substitution of the result into (18) concludes the proof. 

□ 
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